This lecture was inspired by and/or modified in part from “Text Mining Fedspeak” by Len Kiefer
The Federal Reserve is a natural target of text mining for economists. The Federal Open Market Committee (FOMC) monetary policy statement is parsed and prodded each time the FOMC announces a change (or no change at all). For example, the Wall Street Journal provides a Fed Statement Tracker which allows you to compare changes from one FOMC statement to another. Narasimhan Jegadeesh and Di Wu have a paper “Deciphering Fedspeak: The Information Content of FOMC Meetings” paper on ssrn that uses text mining techniques on FOMC meeting minutes.
Researchers have also looked at the transcripts for FOMC meetings. San Cannon has a paper “Sentiment of the FOMC Unscripted” pdf that applies text minging tools to FOMC transcripts.
We’ll look at the Federal Reserve’s semi-annual Monetary Policy Report. This report is typically issued in February and July, with the latest report for July \(2022\). We can download pdf files for each July report from \(1996\) through \(2022\), though the url form has changed slightly.
Let’s first load in a single report for July \(2022\) available at: https://www.federalreserve.gov/monetarypolicy/files/20220617_mprfullreport.pdf
We’ll use pdftools to import the pdf file:
# Importing a single PDF
v=fed_links[length(fed_links)]
fed_import=pdf_text(v)
str(fed_import)
chr [1:77] " For use at 11:00 a.m. EDT\n "| __truncated__ ...
The pdf_text function provides us with a list of
strings, one for each page (there are a total of 77 pages in this
report).
Let’s take a look at page \(7\)’s first \(500\) characters:
substr(fed_import[7], 1, 500)
[1] " 1\n\n\n\n\nSummary\nIn the first part of the year, inflation remained Recent Economic and Financial\nwell above the Federal Open Market Developments\nCommittee’s (FOMC) longer-run objective\nof 2 percent, with some inflation measures Inflation. Consumer price inflation, as\nrising to their highest levels in more than measured by the 12-month change in the\n40 years. These "
As we can see, there’s a good amount of blank spaces and special
characacters \n indicating linebreaks.
We can deal with this by splitting on \n with
strsplit().
v=v %>%
str_replace_all(c("./pdfs/"="", "_"="", ".pdf"=""), "") %>%
as_date %>% format("%b%Y")
# Get the pages and then the line for each page
fed_text_raw = data.frame(
text = fed_import,
stringsAsFactors = FALSE
) %>%
mutate(
page = row_number(),
text = strsplit(text, "\n"),
report = v
) %>% # Separate by line
unnest(text) %>%
group_by(report) %>%
mutate(line = row_number()) %>%
ungroup()
Now we can apply the tidytext mining techniques.
fed_text=fed_text_raw %>%
as_tibble() %>%
tidytext::unnest_tokens(word, text)
# Nice table format
# datatable(fed_text, options = list(autoWidth = TRUE))
Let’s count up the words:
fed_text %>%
count(word, sort = TRUE) %>%
datatable(options = list(autoWidth = TRUE))
There are a lot of common words like: “the”, “of”, and “in”. In text
mining, these words are called “stop words”. We can remove them by using
anti_join and the stop_words list that comes
with the tidytext package.
fed_text %>%
anti_join(stop_words) %>%
count(word, sort = TRUE) %>%
datatable(options = list(autoWidth = TRUE))
The Fed really likes talking about rates, but we also find that there are some numbers in the text. The year \(2022\) appears a lot.
Let’s drop numbers from the text. In older reports, they liked to used fractions with special characters, so we’ll take a heavy-handed approach and only keep alphabetic characters.
fed_text_2=fed_text %>%
mutate(word = gsub("[^A-Za-z ]", "", word)) %>%
filter(word != "")
fed_text_2 %>%
anti_join(stop_words) %>%
count(word, sort = TRUE) %>%
datatable(options = list(autoWidth = TRUE))
What’s the overall sentiment of the report? Text mining allows us to
try to score text, or portions of text for sentiment. We can apply one
of the sentiments datasets supplied by tidytext to score
the report. For this example, we will use the bing library
based on Bing
Liu and collaborators.
Let’s see what the most frequently used negative and positive words
are based on the bing lexicon.
tbl=fed_text_2 %>%
inner_join(get_sentiments("bing")) %>%
count(word, sentiment, sort = TRUE)
tbl %>%
datatable(options = list(autoWidth = TRUE))
In this report, we can see that risks, a negative word, is used 43 times, appropriate, a positive word, is used 33 times. Debt, the 6 most frequent word, is considered negative, however, in an economic report it might be more descriptive than positive/negative.
We can apply tidytext principles to single words and a
consecutive sequence of words, called n-grams.
fed_bigrams=fed_text_raw %>%
unnest_tokens(bigram, text, token = "ngrams", n = 2) %>%
as_tibble()
# fed_bigrams %>% datatable(options = list(autoWidth = TRUE))
Count the bigrams:
fed_bigrams %>%
count(bigram, sort = TRUE) %>%
datatable(options = list(autoWidth = TRUE))
As Silge and Robinson point out, many of the bigrams are uninteresting. Let’s filter out uninteresting bigrams that contain stop words.
bigrams_separated=fed_bigrams %>%
separate(
bigram,
c("word1", "word2"),
sep = " "
)
bigrams_filtered=bigrams_separated %>%
filter(
word1 %nin% stop_words$word,
word2 %nin% stop_words$word
)
bigram_counts=bigrams_filtered %>%
count(word1, word2, sort = TRUE)
bigram_counts %>%
datatable(options = list(autoWidth = TRUE))
# Unite them
bigrams_united = bigrams_filtered %>%
unite(bigram, word1, word2, sep = " ")
bigrams_united %>%
datatable(options = list(autoWidth = TRUE))
Now, let’s find out if the Fed used the word (“gross”) mostly surrounding GDP.
bigrams_filtered %>%
filter(word1 == "gross") %>%
count(word2, sort = TRUE) %>%
datatable(options = list(autoWidth = TRUE))
After analyzing the report word frequencies I came up with a list of
words that probably aren’t either negative or positive in the usual
sense. I added them to the original stop_words and created
a custom_stop_words.
custom_stop_words2=bind_rows(
tibble(
word = c("debt", "gross", "crude", "well", "maturity", "work", "marginally", "leverage"),
lexicon = c("custom")
),
stop_words
)
fed_sentiment=fed_text %>%
anti_join(custom_stop_words2) %>%
inner_join(get_sentiments("bing")) %>%
count(report, index = line %/% 80, sentiment) %>%
spread(sentiment, n, fill = 0) %>%
mutate(sentiment = positive - negative)
# fed_sentiment %>% datatable(options = list(autoWidth = TRUE))
ggplot(fed_sentiment, aes(index, sentiment, fill = sentiment > 0)) +
geom_col(show.legend = FALSE) +
scale_fill_manual(values = c("red","#27408b")) +
facet_wrap(~report, ncol = 5, scales = "free_x") +
theme_ridges(font_family = "Arial") +
labs(
x = "index (approximately 3 pages per unit)",
y = "sentiment",
title = "Sentiment through Federal Reserve Monetary Policy Report",
subtitle = "customized bing lexicon",
caption= "Source: https://www.federalreserve.gov/monetarypolicy/files/20220617_mprfullreport.pdf"
)
This trend tells an interesting story. The text beggan positive, dropped off but then surged in the middle. Around the last third of the text, near Part 3: Summary Of Economic Projections sentiment turns negative as the text describes forecasts and risks.
Let’s expand our analysis by capturing the text of each Monetary Policy Report for July from \(1996\) through \(2022\). We’ll compare the relative frequency of words and topics and see how sentiment (as we captured it above) varies across reports.
df_fed=fed_links %>%
tibble(
link=.,
report=str_replace_all(., c("./pdfs/"="", "_"="", ".pdf"=""), "") %>% as_date %>% format("%b%Y")
) %>%
mutate(text=future_map(link, pdf_text)) %>%
select(-link) %>%
unnest(text) %>%
group_by(report) %>%
mutate(page = row_number()) %>%
ungroup() %>%
mutate(text = strsplit(text, "\n")) %>%
unnest(text) %>%
group_by(report) %>%
mutate(line = row_number()) %>%
ungroup() %>%
select(report, line, page, text)
# df_fed %>% datatable(options = list(autoWidth = TRUE))
Let’s start with just hte number of words per report.
fed_words=df_fed %>%
unnest_tokens(word, text) %>%
count(report, word, sort = TRUE) %>%
ungroup()
total_words = fed_words %>%
group_by(report) %>%
summarize(total = sum(n))
total_words %>%
datatable(options = list(autoWidth = TRUE))
ggplot(data = total_words, aes(x=as.numeric(str_match(report, "[0-9]+")), y = total))+
geom_line(color = "#27408b")+
geom_point(shape = 21, fill = "white", color = "#27408b", size = 3, stroke = 1.1)+
scale_y_continuous(labels = scales::comma)+
theme_ridges(font_family = "Arial")+
labs(
x = "year",
y = "number of words",
title = "Number of words in Federal Reserve Monetary Policy Report",
subtitle = "July of each year 1996-2022",
caption = "Source: Federal Reserve Board Monetary Policy Reports"
)
The Jul2012 report is one of the longer reports with over 33145 words. We can also see a pretty clear break at the end of the Greenspan tenure in \(2005\) as the reports got substantially longer.
Let’s compile a list of the msot frequently used words in each report. As before, we’ll omit stop words
fed_text = df_fed %>%
select(report, page, line, text) %>%
unnest_tokens(word, text)
fed_topic = fed_text %>%
mutate(word = gsub("[^A-Za-z ]", "", word)) %>% # keep only letters (drop numbers and special symbols)
filter(word != "") %>%
anti_join(stop_words) %>%
group_by(report) %>%
count(word, sort = TRUE) %>%
mutate(rank = row_number()) %>%
ungroup() %>%
arrange(rank, report) %>%
filter(rank < 11)
# fed_topic %>% datatable(options = list(autoWidth = TRUE))
# Most Frequent Words
ggplot(fed_topic, aes(y = n, x = fct_reorder(word, n))) +
geom_col(fill = "#27408b") +
facet_wrap(~report, scales = "free", ncol = 5) +
coord_flip() +
theme_ridges(font_family = "Arial", font_size = 10) +
labs(
x = "",
y = "",
title = "Most Frequent Words Federal Reserve Monetary Policy Report",
subtitle = "Excluding stop words and numbers.",
caption = "Source: Federal Reserve Board Monetary Policy Reports"
)
Lots of talking about rates. Let’s see if we can get some more information out of this data.
Following Silge and Robinson we can use the bind_tf_idf
function to bind the term frequency and inverse document frequency to
our tidy text dataset. This statistic will decrease the weight on very
common words and increase the weight on words that only appear in a few
documents. In essence, extracting the most important information from
each report.
We’ll also clean out some additional terms that the
pdftools picked up (like monthly abbreviations) by
augmenting our stop word list.
custom_stop_words = bind_rows(
tibble(
word = c(
tolower(month.abb), "one","two","three","four","five","six",
"seven","eight","nine","ten","eleven","twelve","mam","ered",
"produc","ing","quar","ters","sug","quar",'fmam',"sug",
"cient","thirty","pter",
"pants","ter","ening","ances","www.federalreserve.gov",
"tion","fig","ure","figure","src"
),
lexicon = c("custom")
),
stop_words
)
fed_text_b = fed_text %>%
mutate(word = gsub("[^A-Za-z ]", "", word)) %>%
# keep only letters (drop numbers and special symbols)
filter(word != "") %>%
count(report, word, sort=TRUE) %>%
bind_tf_idf(word, report, n) %>%
arrange(desc(tf_idf))
# Remove the stop words
fed_text_b_filtered = fed_text_b %>%
anti_join(custom_stop_words, by = "word") %>%
mutate(word = factor(word, levels = rev(unique(word)))) %>%
group_by(report) %>%
mutate(id = row_number()) %>%
ungroup() %>%
filter(id < 11)
fed_text_b_filtered %>%
datatable(options = list(autoWidth = TRUE))
# Highest tf-idf Words by Report
ggplot(fed_text_b_filtered, aes(word, tf_idf, fill = report)) +
geom_col(show.legend = FALSE) +
labs(x = NULL, y = "tf-idf") +
facet_wrap(~report, scales = "free", ncol = 5)+
coord_flip()+
theme_ridges(font_family = "Arial", font_size = 10)+
theme(axis.text.x=element_blank())+
labs(
x="",y ="tf-idf",
title="Highest tf-idf words in each Federal Reserve Monetary Policy Report: 1996-2022",
subtitle="Top 10 terms by tf-idf statistic: term frequncy and inverse document frequency",
caption="Source: Federal Reserve Board Monetary Policy Reports\nNote: omits stop words, date abbreviations and numbers."
)
This chart tells an interesting story, we can see the emergence of certain acronyms like JGTRRA (Jobs and Growth Tax Relief Reconciliation Act), TALF (Term Asset-Basked Securities Loan Facility) or LPFR (Labor Force Participation Rate). You can also see terms like terrorism (\(2002\)) and war (\(2003\)) associated with major geopolitical events.
The Monetary Policy Report also contains a special topic, and you can see signs of them in some of the reports. For example, the \(2016\) report has the special topic: “Have the Gains of the Economic Expansion Been Widely Shared?” that discussed economic trends across demographic groups. You can see evidence of that with the prevalence of terms like: “hispanic”, “race”, “black”, and “white” in said report.
How did sentiment vary across each report?
fed_sentiment = fed_text %>%
anti_join(custom_stop_words2) %>%
inner_join(get_sentiments("bing")) %>%
count(report, index = line %/% 80, sentiment) %>%
spread(sentiment, n, fill = 0) %>%
mutate(sentiment = positive - negative)
fed_sentiment %>%
datatable(options = list(autoWidth = TRUE))
# Sentiment Across the Years
ggplot(fed_sentiment, aes(index, sentiment, fill = sentiment > 0)) +
geom_col(show.legend = FALSE) +
scale_fill_manual(values = c("red","#27408b"))+
facet_wrap(~report, ncol = 5, scales = "free_x")+
theme_ridges(font_family = "Arial")+
labs(
x = "index (approximately 3 pages per unit)",
y = "sentiment",
title = "Sentiment through Federal Reserve Monetary Policy Report",
subtitle = "customized bing lexicon",
caption = "Source: Federal Reserve Board Monetary Policy Reports"
)
This result shows that sentiment tended to be negative between
2001 - 2003 and 2008 - 2009, which were around
the last two economic reccessions.
Let’s compute total sentiment by report.
fed_sentiment_2 = fed_text %>%
anti_join(custom_stop_words2) %>%
inner_join(get_sentiments("bing")) %>%
count(report, sentiment) %>%
spread(sentiment, n, fill = 0) %>%
mutate(sentiment = positive - negative)
fed_sentiment_2 %>%
datatable(options = list(autoWidth = TRUE))
# Sentiment by Report
ggplot(
fed_sentiment_2,
aes(factor(str_match(report, "[0-9]+")), sentiment/(negative + positive), fill = sentiment)
) +
geom_col(show.legend = FALSE) +
scale_fill_viridis_c(option = "C") +
theme_ridges(font_family = "Arial", font_size = 10) +
labs(
x = "report for July of each year",
y = "Sentiment (>0 positive, <0 negtaive)",
title = "Sentiment of Federal Reserve Monetary Policy Report: 1996-2022",
subtitle = "customized bing lexicon",
caption = "Source: Federal Reserve Board Monetary Policy Reports"
)
We can follow Silge and Robinson and construct a graph to visualize word correlations and cluster of words. We’ll compute pairwise word correlations and then construct a graph to represent these correlations.
word_cors = fed_text_2 %>%
mutate(section = row_number() %/% 10) %>%
filter(section > 0) %>%
filter(word %nin% stop_words$word) %>%
group_by(word) %>%
filter(n() >= 20) %>%
pairwise_cor(word, section, sort = TRUE)
# word_cors %>% datatable(options = list(autoWidth = TRUE))
word_cors_filtered = word_cors %>%
filter(correlation > .15)
graph_from_data_frame(word_cors_filtered) %>%
ggraph(layout = "fr") +
geom_edge_link(aes(edge_alpha = correlation), show.legend = FALSE) +
geom_node_point(color ="#27408b", size = 5) +
geom_node_text(aes(label = name), repel = TRUE) +
theme_void(base_family = "Arial")+
labs(
title= "Pairs of words in Federal Reserve Monetary Policy Reports that show at\n least a 0.15 correlation of appearing within the same 10-line section",
caption= "Source: July Federal Reserve Board Monetary Policy Reports 1996-2022\n"
)